About the project

Course description

This is a course on Open Data Science.

Github
My Github repository can be found from here


Regression and model validation

Reading the data in to R

Load GGally (and ggplot2), set the working directory to the IODS folder and read in the local tab separated copy of the file

library(GGally)
## Loading required package: ggplot2
setwd("~/Work/IODS-project")
analysis2014 <- read.table("data/analysis2014.txt", sep="\t", header=TRUE)

The structure and the dimensions of the data

str(analysis2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
dim(analysis2014)
## [1] 166   7

The data has 166 rows and 7 columns. The columns show the age, gender, attitude and points of the 166 students.
In addition the data frame has three combined scores for deep learning, structured learning and strategic learning.

Data overview

ggpairs(analysis2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

summary(analysis2014)
##       Age           Attitude         Points      gender       deep      
##  Min.   :17.00   Min.   :14.00   Min.   : 7.00   F:110   Min.   :1.583  
##  1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   M: 56   1st Qu.:3.333  
##  Median :22.00   Median :32.00   Median :23.00           Median :3.667  
##  Mean   :25.51   Mean   :31.43   Mean   :22.72           Mean   :3.680  
##  3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75           3rd Qu.:4.083  
##  Max.   :55.00   Max.   :50.00   Max.   :33.00           Max.   :4.917  
##       surf            stra      
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:2.417   1st Qu.:2.625  
##  Median :2.833   Median :3.188  
##  Mean   :2.787   Mean   :3.121  
##  3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.333   Max.   :5.000

Most of the variables have a close to normal distribution, except age that is closer to poisson distribution. There’s twice as much women in the data set. The strongest correlation can be seen between Attitude and Points.

Regression model

lm1 <- lm(Points~Age+gender+Attitude, data=analysis2014)
summary(lm1)
## 
## Call:
## lm(formula = Points ~ Age + gender + Attitude, data = analysis2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Age         -0.07586    0.05367  -1.414    0.159    
## genderM     -0.33054    0.91934  -0.360    0.720    
## Attitude     0.36066    0.05932   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08
lm2 <- lm(Points~Attitude, data=analysis2014)
summary(lm2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = analysis2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

In the model with age, gender and attitude as explanatory variables, only attitude was significantly affecting the points. Age or gender do not affect the points. In the final model the attitude has significant positive correlation with points

Model diagnostics

par(mfrow=c(2,2))
plot(lm2, which = c(1,2,5))

It can be seen from the diagnostic plots for the second model that there’s no pattern in the residuals, so we have constant variance. There are some possible outliers as shown in the plot. The residuals have a fairly normal distribution, again excluding the some possible outliers. Also the leverage plot shows that few points could be outliers in this data set.


Logistic regression

The libraries needed in the analyses

library(GGally)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read in the data

We start by reading in the data that was saved after the data wrangling part.

alc <- read.table("data/alc_data.txt", sep="\t", header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data consists of different attributes of students from two Portugese schools. Some of the data points in the two sets are by the same students and the data sets have been joined based on their answers. Only the students included in both questionnaires are included.

Choose the variables and explore them

The four variables I think could be affected or linked with high or low alcohol consumption are:

alc_rf <- randomForest(high_use ~ ., data=alc)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
varImpPlot(alc_rf)

  • sex (males tend to drink more than females)
  • abcenses (The more you drink, the more you might be absent)
  • G3, the final grade (The consumption of alcohol could affect the final grade from the course)
  • goout, going out with friends (What else would you do with friends)
ggpairs(alc[,c("high_use", "sex", "absences", "G3", "goout")],  aes(col=high_use, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

table(high_use = alc$high_use, sex = alc$sex)
##         sex
## high_use   F   M
##    FALSE 156 112
##    TRUE   42  72
boxplot(absences~high_use, data=alc, ylab="Number of absences", xlab="High use")

boxplot(G3~high_use, data=alc, ylab="Final grade", xlab="High use")

boxplot(goout~high_use, data=alc, ylab="Number of times going out w/ friends", xlab="High use")

There seems to be slightly more heavy users in male. Also the heavy users seem to have more absences, but their grades do not differ. Probably the strongest association can be seen in the times going out with friends. The more you go out with friends, the more likely you are a heavy user.

Logistic model

m1 <- glm(high_use ~ sex + absences + G3 + goout, data = alc, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ sex + absences + G3 + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9286  -0.8055  -0.5308   0.8264   2.4649  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.57370    0.68720  -5.200 1.99e-07 ***
## sexM         0.95202    0.25496   3.734 0.000188 ***
## absences     0.08215    0.02237   3.672 0.000241 ***
## G3          -0.04453    0.03883  -1.147 0.251571    
## goout        0.70844    0.12065   5.872 4.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 386.44  on 377  degrees of freedom
## AIC: 396.44
## 
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept)        sexM    absences          G3       goout 
## -3.57369722  0.95202361  0.08214567 -0.04452530  0.70844342
OR <- coef(m1) %>% exp()
CI <- confint(m1) %>% exp()
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR       2.5 %    97.5 %
## (Intercept) 0.02805195 0.006927766 0.1031581
## sexM        2.59094742 1.581095123 4.3051892
## absences    1.08561394 1.040315244 1.1370409
## G3          0.95645140 0.886134749 1.0323272
## goout       2.03082765 1.612303135 2.5901641

From the model summary and the odds ratios and their coefficients we can see that the sex, number of absences and the number of times going out with friends are all associated with the hugh use of alcohol.
Males use more alcohol, the more absences the more likely to use more alcohol, and the more times going out with friends, the more likely to use more alcohol.

Refined model and the predictive power

m2 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)

The cross tabulation from the predictive power of our model and a graphical vialization of it.

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
ggplot(alc, aes(y = high_use, x = probability, col=prediction)) + geom_point()

From the cross tabulation we can see that the predictive power is close to 0.80, meaning ~20 % are wrongly predicted. Next we can compare the predictive power to a simple guessing strategy.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(alc$high_use, prob=0)
## [1] 0.2984293
loss_func(alc$high_use, alc$probability)
## [1] 0.2094241

We can see tht the model was better than simple guessing (~0.21 error vs. ~0.30 error).
We can refine our cross validation by performing a 10-fold cross-validaton.

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cv$delta[1]
## [1] 0.2251309

We can see that this model was better than the one introduced in the DataCamp (0.26 error).
Hooray for that!


Clustering and classification

Load the data

Load the data from the MASS package and explore the structure and dimensions

library(dplyr)
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset is a data frame with 506 rows and 14 columns. The columns present different housing values in the suburbs of Boston.

Graphical overview of data

pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

There’s both continuous and binary variables in the data. Some of the varibles are highly correlated, either negatively, such as lower status of the population (lstat) and median value of owner-occupied homes in $1000 (medv), or positively, such as full-value property-tax rate per $10,000 (tax) and proportion of non-retail business acres per town (indus).
However, they are not all normally distributed or have the same variance which are the assumptions for LDA. So we need to scale the variables.

Standardization and categorize the crime rate

We scale the values to have a mean of 0 and standard deviation of 1.

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
apply(boston_scaled,2, sd, na.rm=TRUE)
##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##       1       1       1       1       1       1       1       1       1 
##     tax ptratio   black   lstat    medv 
##       1       1       1       1       1
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then we use the quantile function to break the crime rates to 4 bins and categorize the crime rates to 4 different categories. Then we remove the original crime rate variuable and substitute it with the new categorial variables.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

After that we divide the data to test and train sets, with 80 % of the data going to train set. We also save the correct classes from the test set and remove the crime variable from it.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis

lda.fit <- lda(crime~., data=train)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen=2, col=classes)

Predictions from the LDA

First we save the correct classess in the test set and remove it from the data. Then we predict the crime rate in the test set using the model from ther train set and cross-tabulate it with the correct classes.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata=test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13       7        1    0
##   med_low    7      17        4    0
##   med_high   0      10       14    0
##   high       0       0        0   29

The model predicted quite well the crime categories. Probably with fewer classes, e.g. 3, the result would have been even better.

Euclidean distance and k-means clustering

data("Boston")
boston_scaled <- scale(Boston)
boston_euk_dist <- dist(boston_scaled)

Then we can run the k-means clustering with 3 clusters and visualise the result with few relevant variables.

km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[,6:10], col = km$cluster)

It seems that 3 clusters might not be the best, so we can investigate the optimal number of clusters.

set.seed(123)

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

It seems that 2 is the most optimal number of clusters, so let’s visualise that on the same variables.

km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,6:10], col = km$cluster)

Looks better.